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Abstract 

Shotgun proteomics has recently ennerged as a powerful approach to characterizing proteormes in biological 
samples. Its overall objective is to identify the form and quantity of each protein in a high-throughput manner by 
coupling liquid chromatography with tandem mass spectrometry. As a consequence of its high throughput nature, 
shotgun proteomics faces challenges with respect to the analysis and interpretation of experimental data. Among 
such challenges, the identification of proteins present in a sample has been recognized as an important 
computational task. This task generally consists of (1) assigning experimental tandem mass spectra to peptides 
derived from a protein database, and (2) mapping assigned peptides to proteins and quantifying the confidence of 
identified proteins. Protein identification is fundamentally a statistical inference problem with a number of methods 
proposed to address its challenges. In this review we categorize current approaches into rule-based, combinatorial 
optimization and probabilistic inference techniques, and present them using integer programing and Bayesian 
inference frameworks. We also discuss the main challenges of protein identification and propose potential solutions 
with the goal of spurring innovative research in this area. 



Introduction 

The main objective of mass spectrometry-based proteo- 
mics is to provide a molecular snapshot of the form (e.g. 
splice isoforms, post-translational modifications), abun- 
dance level, and functional aspects (e.g. protein-protein 
interactions, protein localization) of each protein in a 
biological sample [1-3]. Among proteomics strategies, 
bottom-up or shotgun proteomics has emerged as a 
high-throughput technology capable of characterizing 
hundreds of proteins at the same time. In this scenario, 
proteins in a sample are first digested into peptides, typi- 
cally using site-specific proteolytic enzymes (e.g. trypsin). 
Peptides are then separated by liquid chromatography 
(LC) and analyzed by tandem mass-spectrometry (MS/ 
MS) resulting in a set of MS/MS spectra [4]. In contrast 
to the top-down proteomics strategy, where intact pro- 
teins are directly analyzed through mass spectrometers, 
shotgun proteomics is characterized by high separation 
efficiency and mass spectral sensitivity. At the same time, 
it places higher demands on the computational and sta- 
tistical techniques necessary for peptide identification, 
protein identification, and label-free quantification. 
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In a standard computational pipeline, MS/MS spectra 
from a mass spectrometer are searched against spectral 
libraries [5-8] and/or in silico spectra [9-14] corresponding 
to peptides from a protein database in order to provide 
peptide-spectrum matches (PSMs). Such a database search, 
depending on the parameters of the search and the MS/ 
MS platform, can result in a large number of PSMs that 
are assigned scores indicating the confidence level of cor- 
rect identification of the respective peptide. The next step 
is to assemble a list of identified proteins from all, or a 
subset of, PSMs and provide statistical confidence levels 
for each protein. 

Protein identification is a special case of label-free pro- 
tein quantification because, in an ideal scenario, each 
protein with a correctly inferred non-zero quantity 
(abundance) would be considered identified. However, 
label-free quantification has not yet reached the accuracy 
needed for the wide dynamic range of quantities observed 
in cellular or extracellular proteomes [15]. In addition, in 
many practical situations it suffices to only consider the 
existence of proteins in the sample and not their exact 
quantity. Thus, solving the more general and significantly 
more difficult problem of quantification to provide a 
solution to its subproblem may result in less accurate 
solutions to protein identification. 



o 
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Obtaining a list of identified proteins from a set of pep- 
tide sequences with identification scores may seem 
straightforward. However, there are several factors that 
combine to challenge such intuition: (1) Usually only a 
small number of peptide identifications, mostly unreli- 
able, are available for each protein [16]. This is because 
only the top-scoring PSMs for each peptide are typically 
included into the candidate set for peptide identifications, 
and among those candidates only a small subset are con- 
sidered to be confident identifications. This leads to diffi- 
culties in providing confident protein identifications, e.g. 
if only a single peptide is identified from a protein. (2) 
Peptides, even those from the same protein, are not 
equally likely to be identified in a proteomics experiment 
[17-19]. The probability that a peptide is identified in a 
standard proteomics experiment has been referred to as 
peptide detectability [19], see additional file 1. (3) Many 
peptide sequences encountered in a typical proteomics 
workflow can be mapped to more than one protein in a 
database. These are referred to as degenerate or shared 
peptides [20,21]. It is a common situation that a eukaryo- 
tic sample contains more degenerate than unique 
peptides^ i.e. peptides that can be mapped to only one 
protein. (4) It is non-trivial to estimate the false discovery 
rates (FDRs) of identified peptides and proteins. Some 
approaches to estimating peptide-level FDRs involve con- 
struction of decoy databases or use unsupervised estima- 
tion of class-conditional distributions (distributions of 
PSM scores given correct and false identifications, 
respectively). However, a large number of low-scoring 
PSMs may create difficulties in determining the certainty 
of both peptide and protein identification. While meth- 
ods for the estimation of peptide-level FDRs have been 
well-characterized, computing protein-level FDRs 
remains an open problem [22,23]. 

The process of identifying proteins that are present in a 
biological sample is now widely framed as a statistical 
inference problem, and has been referred to as the 
protein inference problem [20,21]. To date, a number of 
approaches have been proposed to address this problem 
[20,35-37]. We categorize those approaches into three 
broad groups, noting that a particular method may 
exploit more than one strategy: 

1. Rule-based strategies - methods that rely on a 
relatively small set of confidently identified (unique) 
peptides that are subsequently assigned to proteins. 

2. Combinatorial optimization algorithms - methods 
that rely on constrained optimization formulations of 
the protein inference problem resulting, for example, 
in minimal protein lists that cover some or all confi- 
dently identified peptides. 

3. Probabilistic inference algorithms - methods that 
formulate the problem probabilistically and assign 



identification probabilities for each protein in a 
database. 

In the following sections, we provide justification for 
the development of advanced protein inference algo- 
rithms and then review the major computational strate- 
gies. All combinatorial optimization techniques are 
presented using a framework of integer programming; on 
the other hand, probabilistic algorithms are summarized 
using Bayesian inference principles. Our focus is also on 
the intuition behind the algorithms, the types of solutions 
generated, and the strengths and limitations of each 
method. We believe this information is essential in order 
to understand commonalities among the algorithms as 
well as their principal differences. It is also important for 
the proper interpretation of outputs from the various 
protein inference tools already applied in bottom-up 
proteomics. 

Notation 

Before discussing algorithmic details, it is important to 
introduce notation that will be used throughout this 
paper. Let us consider a set of tandem mass spectra S 
from a proteomics experiment and let T = {P\^P2, ...} 
be a database of proteins that the spectra are searched 
against. Let also p = {pi,p2> ■■■} be the set of all pep- 
tides in the database and, similarly. Pi be the set of pep- 
tides that belong to protein Pj. We now define two sets 
of indicator variables as follows 

^ _ 1 if peptide pj is confidently identified 
^ [ 0 otherwise 

and 

{1 if peptide pj is present in the sample 
0 otherwise 

Confident peptide identifications can be determined in 
several ways, typically by using strict FDR thresholds on 
the top-scoring PSMs (per peptide) and are estimated 
using a decoy database [22] or tools such as PeptidePro- 
phet [38], which calculate the posterior probability of a 
correct peptide identification. When posterior probabil- 
ities are available, stringent thresholds (e.g. 0.90) can be 
applied directly to those probabilities. Alternatively, suffi- 
ciently high scores from various search engines [9,39-42] 
are sometimes used to select confident identifications. 

It is important to mention that variables U and Xi are 
different. For example, a peptide pj that is confidently 
identified, e.g. using an FDR threshold of 0.01, will 
result in setting tj = 1. On the other hand, Xi can be seen 
as a hidden variable that is to be inferred. Accordingly, 
P(yi = refers to the probability that peptide j is 
present in the sample given all the data from the mass 
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spectrometer. A set of confidently identified peptides, 
using any of the above-mentioned approaches will be 
denoted as C = [pj \ tj = 1}. 

In some situations it will be necessary to consider 
peptides with explicit designations of their parent pro- 
teins. In those cases, the j-th peptide derived from pro- 
tein Pi will be denoted as pij. Two or more such 
peptides will be allowed to have identical amino acid 
sequences. For example, peptides pij and pM {i ^ k) with 
identical amino acid sequences will be referred to as 
degenerate peptides. In the context of protein inference, 
peptides that occur multiple times only within a single 
protein will not be considered degenerate. Finally, we 
define 

{1 if protein Pi is present in the sample 
0 otherwise 

Variable Yi can be seen as an equivalent of Xi at the 
protein level. Thus, P(yi = Ij^") is the posterior prob- 
ability that protein Pi is present in the sample. The sum- 
mary of notation and abbreviations is shown in Table 1. 

Protein inference: significance and algorithms 

Our first goal is to investigate the influence of degener- 
ate peptides and to show that their presence is often a 
major factor contributing to the challenges in protein 
inference. We analyze several cellular and serum sam- 
ples and characterize the peptide identification process. 
The data include cell line and plasma samples from 
Homo sapiens [16], a tissue sample from Mus musculus 
[43], as well as samples from Saccharomyces cerevisiae 
[44] and Deinococcus radiodurans [24]. The sets of spec- 
tra were searched using MASCOT [39] against the 
human IPI database (v3.35), mouse IPI database (v3.35), 
Saccharomyces Genome Database (R63, 05-Jan-2010), 
and D, radiodurans proteins extracted from GenBank 
(27-Aug-2009), respectively. 

Figure lA shows the percentage of identified peptides 
per protein for an FDR of 0.01 (on the unique peptide 
level) when using a reversed database as decoy. We 
observe that 32-63% of proteins are covered by only one 
confidently identified peptide, while 5-36% of proteins 
are covered by five peptides or more. Figure IB shows 
the percentage of degenerate peptides in each sample. 
The results indicate that 57-68% of peptides in human 
and mouse samples are degenerate, regardless of the 
type of biological sample (e.g. cell line vs. tissue vs. 
plasma). On the other hand, the yeast and D, radiodur- 
ans data sets contain only 18% and 1% of degenerate 
peptides, respectively. Figure IC provides the percentage 
of candidate proteins hit by unique peptides. In mouse 
and human samples more than 80% of candidate pro- 
teins are identified only with degenerate peptides. This 
percentage decreases to 23% for yeast and 3% for D. 



radiodurans. Finally, in Figure ID we provide the per- 
centage of protein groups of a particular size, where a 
group is formed from the set of proteins that are hit by 
exactly the same peptides. In accordance with previous 
results, most of the yeast and D, radiodurans candidate 
proteins are distinguishable; however, for human and 
mouse samples, between 30% and 50% of protein groups 
contain multiple proteins. 

This analysis provides evidence that protein inference 
is a non-trivial problem, especially for multicellular 
eukaryotes that are known to contain large numbers of 
paralogous proteins. It also emphasizes the importance 
of developing sophisticated protein inference algorithms. 

Rule-based approaches 

With a typical LC-MS/MS experiment resulting in a 
potentially large number of protein identifications, con- 
cerns were raised regarding the impact of misidentified 
proteins on biomedical science [45]. In response to this, 
several guidelines were proposed regarding the standards 
for publishing proteomics results [46-49]. The so-called 
"two-peptide rule" or two-hit rule, requiring two or more 
confidently identified peptides to define a confident pro- 
tein identification, was advocated [46,48]. The same 
guidelines also recommended the parsimony principle 
(see next Section) as an explanation for the confident 
peptide identifications, and suggested that "protein 
family" - proteins with similar sequences due to single 
amino acid variants, homologs, splicing variants, or anno- 
tation mistakes - should be reported as one group if the 
proteins share the same identified peptides. 

There is a good rationale for using the two-peptide 
rule. In principle, one correct unique peptide should be 
sufficient to correctly identify a protein. However, even 
for the low FDR associated with a set of peptides, many 
individual peptides in a large data set are incorrectly 
identified. Furthermore, proteins identified by single pep- 
tide hits are more likely to be incorrectly identified than 
proteins with higher peptide coverage [45]. It was 
reported that FDRs for single-hit proteins can be over 10 
times higher than FDRs at the PSM level [50], likely due 
to the clustering of correct peptide identifications to the 
correct proteins and the lack of clustering behavior for 
the incorrect peptides [50,51]. 

However, the two-peptide rule has been challenged 
[51,52]. First, while including single-hit proteins without 
stringent quality control can compromise specificity, 
ignoring such proteins will certainly compromise sensi- 
tivity [52]. Second, controlling the confidence (FDR) at 
the peptide level and then deducing the proteins using 
heuristic rules leads to undefined FDRs at the protein 
level [27,50-52]. On the other hand, controlling FDR 
directly at the protein level may rescue some of the con- 
fident single-hit proteins. Indeed, Gupta and Pevnzer 
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Table 1 Summary of notation and abbreviations used throughout this paper. 



Notation 




Description 


s 




Set of all fragmentation spectra outputted by mass spectrometer 






Set of spectra identified for peptide j 


S 




A single fragmentation spectrum, S ^ S 


Pi or /■ 




Protein / 






Peptide j 


Pij 




Peptide j derived from protein /; used to explicitly indicate the parent protein for peptide j 




} 


Protein database, a set of proteins used for peptide and protein identification 




■} 


Peptide database, the set of all (tryptic) peptides derived from 


Pi 




Set of peptides derived from protein Pj 


ti 




Indicator variable, set to 1 if peptide is pj confidently identified 


e = [pj 1 tj = 




Set of peptides that are confidently identified 










Indicator variable, set to 1 if ^ ^ is present in the sample 


Yi 




Indicator variable, set to 1 if P^- G is present in the sample 


X= {Xy ... ,Xj , ...) 




Indicator vector representing all peptides in p 


y = iy^, - , y-i , •••) 




Indicator vector representing all proteins in J> 


A/(/) 




Set of peptides mapped to protein Pj 


A/(/) 




Set of proteins that contain peptide pj 






Indicator vector representing peptides in Pi 






Peptide identification probability, the probability that peptide j is present in the sample given the spectra identified for 
peptide j 


P (x,- = 1 |s) 




The probability of the PSM matching to be correct when peptide \ is the top-scoring match of spectrum 


POi = i\s) 




Protein posterior probabilities, the probability that protein / is present in the sample given all spectra 


d^j iq) 




Detectability of peptide py^ at some specified quantity q; effective detectability 






Detectability of peptide p/,- at standard quantity (f ; standard detectability 






Detectability of peptide pij, effective detectability 


NSPij 




The estimated number of (identified) sibling peptides of peptide p^, used by ProteinProphet to adjust the peptide 
identification probability 


PSM 




Peptide-spectrum match; when it is clear from the context, we use PSM to also refer to the top-scoring PSM per spectrum 


FDR 




False discovery rate; the fraction of incorrect peptide identifications in Q, or the fraction of incorrect protein identifications in 
a given list outputted by a protein inference algorithm. FDR should be distinguished from the false positive rate (FPR), the 
fraction of all peptides (proteins) from the database that are not present in the sample but are predicted to be present (at a 
particular threshold). 



demonstrated that using the "single-peptide rule" results 
in 10-40% more protein identifications compared with 
the two-peptide rule at a fixed FDR level [52]. The sin- 
gle-peptide rule simply uses the highest scoring peptide 
from a protein as a score for that protein, and then 
directly estimates FDR at the protein level (rather than 
at the peptide level) using decoy databases. Thus, any 
protein that has one or more peptides with a score 
above a certain threshold is deemed confident. This 
statement seems problematic because proteins hit by 



single peptides should not be reliable. However, two 
mediocre peptides are not necessarily better than one 
good peptide; thus, many proteins hit by a single peptide 
can be rescued with more stringent score thresholds. 
Since a significant portion of such proteins are correct 
[53], it is not surprising that the single-peptide rule 
leads to more protein identifications. 

With the help of protein-level FDR estimation (using a 
decoy database), better and more complex rules may be 
devised to achieve even higher sensitivity. For example. 
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Figure 1 



Weatherly et al proposed setting separate score thresh- 
olds for proteins with different number of confident 
peptide identifications [51]. They reported that gradually 
lower score thresholds were needed for proteins with 
increasingly higher coverage. For the coverage of 1 (i.e. 
proteins hit by single peptides), a MASCOT score of 44 
was required, while for coverage of 6, a MASCOT score 
as low as 11 was necessary for the same FDR [51]. 

Despite the relative simplicity of rule-based approaches, 
the performance of heuristic rules is fundamentally limited 
by the lack of rigorous treatment and proper combination 
of the peptide identification scores and prior knowledge. 

Combinatorial optimization algorithms 

The input to this class of algorithms typically consists of 
a set of confidently identified peptides C = {pjl^j = l} 
and a protein database J". The objective is to provide a 
list of proteins that optimizes certain criteria. In one 
way or another, all such formulations result in NP-hard 



problems and are usually solved using approximation 
algorithms. 

The minimum set cover formulation 

Minimum set cover (MSC) problem: Given a set of 
confident peptide identifications C and protein data-base 
y, find a smallest protein list C ^ T such that each pep- 
tide from C is assigned to at least one protein from £. 
More formally. 



minimize 




subject to 2^ yi>l (ypj G C), 

i-.PjEpi 

This protein inference formulation is identical to the 
classical computer science problem of minimum set 
cover, where given a set of elements (peptides) U and a 
set of subsets (proteins) over U> the goal is to find a 
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smallest (not necessarily unique) set of subsets that con- 
tain all elements in U- It is convenient to visualize the 
MSG formulation using bipartite graphs (Figure 2A). 
Using graph representation, it is relatively easy to see 
that an optimal solution to the MSG problem can also 
be provided if the original graph is divided into con- 
nected components and an optimal MSG solution pro- 
vided separately for each component. 

The MSG approach has been implemented in the 
IDPicker software [54,55]. IDPicker, however, also con- 
tains several heuristics that further simplify the solution 
and its interpretation. The algorithm starts by collapsing 
the peptide-protein bipartite graph such that all pep- 
tides/proteins connected to the same proteins/peptides 
form group nodes containing multiple peptides or pro- 
teins. It then finds a set of disconnected subgraphs 
within a bipartite graph using a depth-first search. 
Finally, it performs a MSG optimization in each of those 
subgraphs. IDPicker extends beyond algorithmic imple- 
mentations, e.g. it contains modules for calculating con- 
fidently identified peptides (using an FDR-based 
approach), modules for combining scores from multiple 
search engines, as well as visualization of results. 

The minimum set cover formulation is one of the most 
commonly encountered strategies in protein inference, 
and is recommended by the guidelines for publishing 
proteomics results [46,48]. Its intuition is to select the 
smallest among many possible solutions (Occam's razor, 
parsimony principle), which can be justified by consider- 
ing the number of possible solutions when protein list 
consists of exactly n proteins. Assuming n « \!P\, the 
solutions of smaller sizes are selected from a smaller 
solution space and are therefore less likely to be spurious 
findings. In many practical situations, including protein 
inference, the MSG formulation leads to natural and 
acceptable solutions. However, it is not obvious that a 
minimalist formulation should apply to biological 



samples in which multiple paralogous proteins or protein 
isoforms may be present at the same time. This approach 
also ignores other available information, e.g. peptides 
that are not identified (all dashed edges in Figure 2B), 
gene functions [56] or mRNA expression levels [57]. 
The partial set cover formulation 

Although the MSG formulation relies on a set confi- 
dently identified peptides, a subset of such peptides are 
expected to be incorrect identifications. This fact pro- 
vides motivation for the partial set cover approaches 
where the goal is to find the minimum protein list that 
covers at least 100-c% of the identified peptides, where 
0 < c < 1 is a user specified parameter. 

Minimum partial set cover (MPSC) problem: Given 
a set of confident peptide identifications protein data- 
base and parameter c (0 < c < 1), find a protein list C 
of minimal size such that at least 100-c% of identified 
peptides are assigned to the proteins from £. More 
formally. 



minimize 



subject to 



1 



1 



yi > 1 (Vp; 6 C) 



Zj < (1 - c) • lei. 



where Zj g {0, 1} indicates whether peptide Py E C is 
excluded (zj = l) from the list of assigned peptides. Both 
MSG and MPSG problems are NP-hard in general. Thus, 
optimal solutions cannot be guaranteed in situations with 
a large number of identified peptides (note that each pep- 
tide from C adds a constraint in the problem formula- 
tion). A number of approximation algorithms have been 
proposed ranging from greedy algorithms to integer 



A p 




B 




Figure 2 
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programming, and several such algorithms have been 
tested in protein inference [58]. 

Both the MSG and MPSC problem formulations result 
in situations where it is not possible to distinguish among 
proteins identified exclusively by degenerate peptides (e.g. 
proteins Pi and in Figure 2). Nesvizhskii and Aebersold 
have identified several such classes of proteins, naming 
them indistinguishable proteins, subset proteins, subsum- 
able proteins, etc. [20]. Because such situations are com- 
mon for eukaryotes or samples containing multiple closely 
related organisms, different problem formulations are 
necessary to provide appropriate tie resolutions. 
The minimum missed peptide formulation 
The MSC-based formulations of the protein inference 
problem rely only on peptides that were confidently 
identified (C) and thus ignore all unidentified peptides 
from the proteins containing at least one peptide from 
C, see dashed edges in Figure 2B. In addition, these 
methods implicitly assume that each peptide is equally 
likely to be observed in an MS /MS experiment. The first 
combinatorial approach addressing these aspects was the 
minimum missed peptide (MMP) formulation [59]. This 
approach relies on the concept of peptide detectability 
(Box 1). 

To provide intuition for the MMP approach, let us 
consider the example in Figure 3, which itself corre- 
sponds to the bipartite graph from Figure 2B. When con- 
sidering only peptides in C (solid lines in Figure 2B), 
proteins Pi and P2 would be classified as indistinguishable 
[20]; however, given detectabilities of all peptides, it can 
be inferred that protein Pi is more likely to be present in 
the sample than protein Specifically, the three identi- 
fied peptides (shaded) are the most detectable peptides in 
protein P^. On the other hand, these peptides are among 
the least expected peptides to be observed if protein P2 
was in the sample. Thus, protein Pi is more likely to be a 
correct identification than protein P2. Note that the tie 




protein Pj 



protein P2 



Figure 3 



resolution was provided by considering unidentified 
peptides. 

Before formalizing the MMP approach, let us consider 
a particular solution to the protein inference problem in 
which different peptides from C are assigned to protein 
Pf. Note that some peptides Pij E C may not be 
assigned to P/ (xij = O) although their sequence can be 
mapped to the protein and the peptide is confidently 
identified (ty = l). Peptide pij is defined as missed if 
Pij i C and 

dij > min [di^ \pik E C and Xu^ = 1], 

^ k 

where dij is detectability of peptide pij. In other words, 
a peptide is missed if in a particular inference solution 
(1) it is not confidently identified and (2) a peptide with 
lower detectability from the same protein is identified 
and assigned to that protein. We emphasize that the 
peptides with detectabilities lower than the minimum 
detectability of assigned peptides for protein Pj are not 
considered missed due to the fact that protein quantity 
influences effective detectability of all peptides in P/. 
Thus, for effective detectability below a certain thresh- 
old, no peptides are expected to be observed. The MMP 
approach can now be formalized as follows. 

Minimum missed peptide (MMP) problem: Given a 
set of confident peptide identifications Ci protein data- 
base and peptide detectability for each peptide 
Vj E find a set of proteins L Q T that covers all pep- 
tides in C and minimizes the number of missed peptides. 
More formally, 

minimize ^-^ij • (1 — t/) 
subject to {zij - Zik) ' {dij — dik) > 0 (Vz, j e N{i), k e N(i)) 

ieN{j) 

where Zij g {0, 1} indicates whether detectability dij for 
peptide Pij E C is above or equal to (z^ = l) or below 
(zij = O) the minimum detectability of peptides assigned 
to protein Pf and N{i) is a set of peptides connected to 
Pf in the expanded bipartite graph (see Figure 2B). A set 
of identified proteins can now be determined as 



^^J2jeN{i)^ij • - 0 



Alves et al. have shown that the minimum cover set 
problem can be reduced to the minimum missed peptide 
formulation [59]. Thus, the MMP problem is NP-hard 
and approximation algorithms are needed for large-scale 
problems. Alves et al. proposed an efficient greedy 
approximation algorithm that provides a good solution 
[59-61]. Alternative formulations and algorithmic 
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approaches are also possible. For example, this algorithm 
can be generalized in a relatively straightforward manner 
to a partial set formulation or to a version that minimizes 
the overall probability of unidentified peptides. 

Although the MMP formulation was the first protein 
inference technique capable of resolving indistinguishable 
proteins, it generally shares the limitations of other 
approaches based on combinatorial optimization techni- 
ques. That is, these algorithms do not provide probabilities 
for identified proteins, unless post-processing statistical 
models are used [62]. 

Probabilistic inference algorithms 

Similarly to the previous classes of algorithms, probabil- 
istic approaches to protein inference generally consist of 
two steps. First, PSM scores are converted to PSM prob- 
abilities using algorithms such as PeptideProphet [38]. 
After this pre-processing step, protein inference is per- 
formed based on an assumed probabilistic model. In 
probabilistic terms, protein inference involves comput- 
ing protein posterior probabilities P(yi = 1\S} for every 
protein in y. 

Several classes of probabilistic algorithms have been 
proposed so far [21,24,60,61,63-71], with different strate- 
gies and levels of rigor in addressing protein groups and 
different run-time performance. Some probabilistic algo- 
rithms do not address degenerate peptides [63,65,68,70], 
while some such as ProteinProphet [21] combine prob- 
abilistic inference with the parsimony principle (for 
degenerate peptides) and protein grouping (for indistin- 
guishable proteins). In the following subsections, we 
provide an in-depth discussion of the three major prob- 
abilistic methods: ProteinProphet [21], MSBayesPro [61], 
and Fido [71], and briefly mention several other meth- 
ods. We use the same notation for all models and, when 
possible, provide new interpretations of the algorithms. 
We aim to reveal inherent connections and principal 
differences among the methods. For original derivations 
and interpretations, readers are referred to the original 
publications. 
ProteinProphet 

ProteinProphet is the first and most widely used probabil- 
istic protein inference approach [21], with importance 
comparable to the first automated peptide identification 
tool, SEQUEST [9]. ProteinProphet consists of four major 
steps; together, they convert the original PSM probabilities 
from PeptideProphet to peptide identification probabilities 
and then combine the peptide identification probabilities 
to infer proteins. 

Pre-processing In order to obtain protein identification 
probabilities, peptide identification probabilities are 
needed as input. Here, the difficulty is to obtain one 
peptide identification probability from typically multiple 
spectra matched to a peptide. The solution used in 



ProteinProphet is to simply take the maximum value 
among the peptide-spectrum matching probabilities for 
peptide j (step 1, Figure 4A), i.e. 

P(Xj = l\Sj) = maxsesj P{^j = Ms), 

where Sj is the set of spectra identified for peptide j. 
If no spectrum is matched to the peptide, i.e. if Sj = 0, 
then P(Xj = l\Sj) = 0. Recently, the iProphet algo- 
rithm was proposed to improve this approach [72]. 
Combining peptide probabilities A key feature of Pro- 
teinProphet is that protein probabilities are computed 
by assuming peptide identifications to be independent 
pieces of evidence for the presence of protein i in the 
sample, i.e. 

where N (f) is the set of peptides mapped to protein i 
This assumption, however, is not easy to justify because 
peptide identifications are not statistically independent. 
That is, if one peptide from the protein is confidently 
identified, the chance is higher that another peptide 
from the same protein will also be identified. Another 
problem with this assumption is that each degenerate 
peptide is counted toward all proteins it maps to. These 
issues are addressed via the following two adjustment 
steps. 

Adjustment for peptide identification probability To 

address the limitation due to the independence assump- 
tion, ProteinProphet replaces P{Xj = in the equa- 
tion above by P^Xj = l\Sy, step 2, Figure 4A. The 
difference between the adjusted peptide identification 
probability P^Xj = l\Sy) and the original peptide identi- 
fication probability P(Xy = 1|<S^) comes from the presence 
of other spectra (peptides) mapped to the same protein as 
peptide j. They are expected to change the confidence of 
peptide identification. However, it is not straightforward 
to estimate P^Xj = l\Sy Nesvizhskii et al. defined the 
expected number of sibling peptides (NSP), i.e. the num- 
ber identified peptides (other than peptide pj) weighted by 
the adjusted peptide identification probability 
P{Xj = 1|<5^)), from the same protein. Specifically, 

NSPij = ^^2j'eN{i)fy^i^^^^' ^ where f indexes a parent 
protein of peptide j (step 4, Figure 4A). ProteinProphet 
then approximates P(Xj = 1\S) » P(Xj = l\Sj,NSPij), 
which is computed from and P(NSPij\Xj)) by 

using the Bayes rule. Since computing NSPij requires 
P{Xj = 115^), and computing P(Xj = 1\S) requires 
NSPip iterative updating is used until convergence (steps 
2, 4; Figure 4A). 
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Adjustment for peptide degeneracy In order to address 
degenerate peptides, a weighting scheme is used to 
modify protein probabilities to 



P{yi = l\S) = l 



Wij-P{xj = l\S)), 



where Wij is the "proportion" of peptide J assigned to 
protein / (step 3, Figure 4A), Nesvizhskii et al defined that 

Wij = P[yi = M^)/J2ireN{j]^^^'' " ^^^^^ is the 
set of proteins that contain peptide j (step 5, Figure 4A). 
This adjustment step is in accordance with the parsimony 
principle cause J2ieN{i) = ^> i-^- one peptide is ensured 
to come from only one protein in total Note that Wij = 1 
for all unique peptides and that Wij = 0 if peptide ; cannot 
be mapped to protein /, i.e. when i ^ N(j). Since the calcu- 
lations of Wij and P(yi = 1\S) are mutually dependent, 
another iterative updating procedure is used until 
convergence. 

By combining these four steps, with a minor modifica- 
tion to include weights Wij for peptides in the NSP 

iVij • P{xj' = 1\S\ 



adjustment step, i.e. NSPy = 



protein identification probability P(yi = 1\S) can be 
approximated through a variant of the expectation-max- 
imization (EM) iterative process (steps 2-5; Figure 4A). 
Since indistinguishable proteins remain indistinguishable 
in ProteinProphet, the grouping strategy is adopted by 
treating the indistinguishable proteins as one protein. 
Therefore, a "group probability", i.e. the probability that 



any one of the proteins in the group is identified, is 
reported. 

As the first probabilistic inference method for protein 
identification, ProteinProphet has been very successful 
and, as part of the Trans-Proteomic Pipeline [73], 
remains the most widely used protein inference tool. 
Although the degenerate peptides are handled by a parsi- 
mony-driven weighting procedure, an iterative method 
by ProteinProphet is used to obtain those weights and 
ultimately results in reasonable probabilities for proteins. 
Recently, the tool has been improved, mainly at the pre- 
processing step, due to iProphet [72]. By using the same 
computational strategy as in the NSP adjustment step of 
ProteinProphet, iProphet obtains one identification prob- 
ability for each peptide by aggregating the PSM probabil- 
ities of the peptide from multiple search engines, spectra, 
experiments, charge states, and PTM states. 
Limitations Because ProteinProphet relies on certain 
strong assumptions, e.g. the parsimony-driven weighting 
(step 5, Figure 4A), its outputs are not always sensible 
from a statistical perspective. One such scenario was 
noticed by the authors [21], that for a set of proteins with 
shared peptides, a protein with a unique peptide, no mat- 
ter how small the identification probability is, always dom- 
inates the protein(s) without unique peptides. In other 
words, the algorithm assigns score 1 to the protein with a 
random but unique peptide identification and score 0 to 
other proteins. This is undesirable, since there are always 
a large number of random peptide identifications with 
close to 0 probabilities in real proteomics data sets. 
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To address the issue, only peptides with probabilities >0.2 
are used to compute protein probabilities. Similarly, we 
observed that the inference outcome of ProteinProphet is 
sensitive to minor changes in peptide probabilities. This 
can be illustrated by a simple example shown in Figure 4B. 
Consider two homologous proteins Pi and P2 with identi- 
fied peptides {pi, p2> Ps) and {pi, p2> p^}, respectively. Sup- 
pose pi and P2 are reliable identifications, but that ps and 
P4 are not, with small identification probabilities. In the 
seven toy datasets (A-G) in Figure 4B, we varied the iden- 
tification probability of peptides ps and p^, and computed 
the protein probability using ProteinProphet. In data sets 
A and E, when the probabilities of unique peptides are not 
larger than 0.5, ProteinProphet considers proteins Pi and 
P2 indistinguishable, and only reports a group probability; 
in data set B, when probability of peptide ps is slightly lar- 
ger than p4 (which has probability 0.5 or less), ProteinPro- 
phet considers protein Pi as much more reliable than P2; 
in data sets C and G, when probability of peptide is 
(slightly) larger than p^ (which has probability 0.5 or less), 
ProteinProphet considers protein P2 as much more reli- 
able than Pi; in data set D, when the probabilities are both 
larger than 0.5, ProteinProphet considers both proteins to 
be reliable; while in data set F, when the probability of 
peptide p^ is 0.2 or less, ProteinProphet suggests that only 
protein P2 can be the true protein, despite the significant 
probability that peptide p^ is a random identification. This 
non-continuity of the inference results is counterintuitive. 
Naturally, one would expect the probability of protein P2 
(Pi) decreases (increases) gradually as the probability of 
peptide p^ decreases. 

Although ProteinProphet applies the parsimony princi- 
ple to the issue of shared peptides, it uses a probabilistic 
model and an EM-like algorithm. Thus, ProteinProphet 
distinguishes itself from the other parsimony principle- 



driven methods, such as the combinatorial approaches dis- 
cussed earlier. However, it is not clear how often Protein- 
Prophet actually leads to the same solutions as other 
various combinatorial approaches regarding proteins with 
shared peptides. In addition, with the presence of degener- 
ate peptides, the inference problem is difficult; thus, it 
would be interesting to compare the EM-like iterative 
algorithm used by ProteinProphet with the heuristics used 
by the combinatorial approaches to examine how effi- 
ciently they handle large data sets. 
MSBayesPro 

MSBayesPro [61] is defined as a full probabilistic protein 
inference method which provides "perhaps the most rig- 
orous existing treatment of the peptide degeneracy pro- 
blem" [71]. The MSBayesPro model includes peptide 
detectability in the probabilistic model; thus it can, to 
some degree, distinguish among "indistinguishable" 
proteins. 

Model structure MSBayesPro is a Bayesian network 
(Figure 5) serving as a generative model for the data. 
The high level structure of the network is simple: Pro- 
teins Peptides Spectra, which mimics the experi- 
mental protocol in proteomics where proteins are first 
digested into peptides, from which spectra are gener- 
ated. Hence, 

P{y,x,S) = P[y)P[x\y)P[S\x) a P[y)P[x\y)P[x\S), 

where 3/ is a vector of random indicator variables for all 
candidate proteins, x is 3. vector of random indicator vari- 
ables representing all peptides from those proteins, and 
S represents the data, i.e. all the spectra generated in the 
experiment. The Peptides Spectra associations are 
defined by the available PSM scores (or probabilities). 
The Proteins Peptides connections, however, are 
determined by the sequences of the peptides and 
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candidate proteins. If the sequence of peptide pj can be 
exactly mapped to protein Pi, there will be an edge point- 
ing from the protein node i to peptide node / in the net- 
work. This is similar to the structure of the model used 
in ProteinProphet, although the latter is not a Bayesian 
network. However, there is an important difference 
between MSBayesPro and ProteinProphet, i.e. all pep- 
tides, identified and unidentified, are included in the 
network structure in MSBayesPro. In contrast, the uni- 
dentified peptides are ignored in ProteinProphet and 
other Bayesian network models [69,71] proposed subse- 
quently. Other than the simplification of the model struc- 
ture, we believe there is no legitimate justification for 
excluding unidentified peptides from a probabilistic 
model. Such peptides will have the identification prob- 
ability P(Xj = l\Sj) = 0; thus Xj = 0 is guaranteed in 
the inference step. We note that it is these unidentified 
peptides that, together with the peptide detectability 
information, will lead to tie resolution between grouped 
proteins and improve the scoring of proteins hit by single 
peptides. 

The MSBayesPro model has an important property in 
that the peptide identifications are conditionally indepen- 
dent given the presence of the parent proteins (Figure 5). 
This is not to be confused with the independence assump- 
tion of peptide identification used in ProteinProphet. 
Actually, the conditional independence assumption in 
MSBayesPro will lead to marginally dependent peptide 
identifications if two peptides share parent proteins 
directly or indirectly through other peptide/protein nodes 
(that is, if the two peptides are in a connected component 
of the graph). Furthermore, the conditional independence 
assumption aligns with the LC-MS/MS experiment. Con- 
sider a protein Pi that is in the sample at some known 
abundance qt. Then, further knowing the information that 
one peptide is already identified from this protein does not 
inform whether another peptide from the same protein 
should be identified in MS/MS or not. With conditional 
independence, we can expand the joint probabilities of the 
set of peptides N{i) (both the identified ones and those 
that are not) from protein / as 

PixMii)\yi = IrYr^i = 0,qi = q) = Y\.^^^^P{xj\yi = ^'Yi'^i = O'^i = 

where qi is the abundance of protein Pi, 
Model inputs and parameters MSBayesPro requires 
peptide identification likelihood ratios and a set of peptide 
detectabilities. The former is a required input to the 
method, and the latter, as required parameters of MSBaye- 
sPro, can be provided as an input, or ideally, peptide 
detectabilities should be estimated via a machine learning 
model from the same data set on which protein inference 
is carried out [24,61]. 



For peptide identifications, the input to MSBayesPro is 
the likelihood ratios P(c?/|x/ = 1)/P(,?/|X/ = 0) 
rather than the peptide identification probabilities 
P{Xj\Sj) that implicitly include a uniform prior [60,61]. 
Here the original peptide-invariant class priors used to 
compute peptide identification probability are replaced 
in MSBayesPro by the peptide sequence and protein 
abundance dependent detectabilities, which are more 
informative priors. We note that this treatment in 
MSBayesPro is somewhat related to the NSP adjustment 
in ProteinProphet, which essentially changes the prior to 
incorporate information from the NSP values (interest- 
ingly, NSP values may roughly reflect protein abun- 
dances, in similar ways as effective detectability). Note 
that unlike detectability, NSP is not specific to the 
sequence of a peptide. 

Using peptide detectability is an important distin- 
guishing feature of MSBayesPro. Detectability is 
required to build the conditional distribution tables 
between the Protein and Peptide layers and subse- 
quently to compute the posterior probabilities for the 
proteins. However, to use detectability properly it is 
important to consider the impact from protein quantity 
(Box 1). Li et al. [60] proposed a quantity adjustment 
formula to convert standard peptide detectability 
d^- = P{xj = l\yi = 1, qi = q^) to effective detectability 
dij{q) = P{xj = l\yi = l,qi = q\ where qi, the quantity of 
protein P^, is estimated by the maximum likelihood or 
moment matching approaches. If a (degenerate) peptide 
pj is shared by multiple proteins, the network structure 
requires combining detectabilities dij over all parent 
proteins of pj. Here, MSBayesPro assumes that 

4 = 1- riiGNO) ^(^i = ^\yi = ^' ^i) = 1- Y\ieN{j) (1 " 4;). 

Alternative approaches in combining multiple detect- 
abilities may also work, but the key intuition is the fol- 
lowing: if, for a given peptide, there are multiple parent 
proteins all present in the sample, the detectability of 
the peptide should be larger than its detectability from 
any of the individual proteins alone. This treatment per- 
mits a non-parsimonious solution, because a degenerate 
peptide is allowed to come from more than one parent 
protein. 

Inference algorithms With the Bayesian network model 
structure and parameters specified, it is in principle easy 
to exactly compute the joint posterior probability for 
the proteins, i.e. P{y\S^ = YixP(.y>^\^y optimal 
solution for the presence of all proteins (the maximum 
a posteriori configuration) is computed as 
Vmap — argmaXy P(y\S). The joint posterior probability 
can be further marginalized to compute P(yi \S) for the 
presence of each individual protein in the sample. In prac- 
tice, this is not always possible due to the prohibitive time 
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complexity, i.e. the inference on Bayesian networks is NP- 
hard in general [74]. MSBayesPro uses Gibbs sampling 
instead of exact computation when a connected compo- 
nent in the Bayesian network is large (it is easy to show 
that connected components should be considered 
separately). 

It is important to note that MSBayesPro also reports 
estimated protein quantities and the marginal posterior 
probabilities for peptides, which provide better scores for 
measuring peptide confidence [61]. Thus, in its core, 
MSBayesPro is also a label-free quantification algorithm. 
Further generalization of the MSBayesPro model has been 
suggested to unify the peptide and protein identification 
problems and perform higher-level inference on genes and 
pathways based on proteomics data [75]. 
Limitations The use of peptide detectability is both the 
strength and a limitation of MSBayesPro. The method 
requires good detectability predictions in order to achieve 
good performance [24] . However, prediction of detectabil- 
ity for non-tryptic peptides and post-translationally modi- 
fied peptides is not a fully solved problem yet, which limits 
the applicability of MSBayesPro. In addition, detectabilities 
cannot be expected to provide tie resolution for proteins 
with nearly identical sequences. These cases, however, 
reveal the limits of shotgun proteomics experiments and 
should be addressed by follow-up experiments such as 
well-designed targeted proteomics experiments. Another 
limitation is related to the computational complexity: effi- 
cient approximation algorithms are necessary for MSBaye- 
sPro to work on very large data sets. 
The Fido model 

The Fido model [71,76] uses a Bayesian network, but was 
primarily designed for fast inference. The major contri- 
bution of this method consists of two graph transforma- 
tions applied to each connected component: collapsing 



protein nodes that are connected to the identical sets of 
peptides and pruning of spectral nodes (with user speci- 
fied parameters) that results in splitting of the connected 
components. Both transformations facilitate tradeoffs 
between the accuracy and speed of the inference step. 
Fido also allows an application of advanced probabilistic 
inference algorithms, e.g. the junction tree algorithm, 
which significantly improve protein inference on large 
graphs. 

There are two major differences in the Bayesian network 
models used by Fido and MSBayesPro. First, unidentified 
peptides are ignored in Fido and a sequence-independent 
parameter is used as a replacement for peptide detectabil- 
ity (Figure 6). Hence, the resulting Bayesian network is 
simpler and inference is faster. Second, another parameter 
P is introduced to the model, which is the prior probability 
for a peptide to be identified from an artificial "noise" 
node. This addresses the situation where input peptide 
probabilities are not accurate (e.g. many incorrect peptides 
are assigned high probability). We believe this is a legiti- 
mate remedy for disasters that can happen during the pep- 
tide probability estimation. However, parameter /3 seems 
to be redundant given that (1 — a)'^^^' is the probability 
for a peptide pj to be identified from "noise". The authors 
indeed observed strong inverse correlation between the 
optimal values of a and /?. 

One limitation of the Fido model is that it requires a 
decoy (randomized) database to find the best values of the 
parameters (a, ^, and K- the prior for the presence of pro- 
teins) by combining an ROC optimization (in a supervised 
manner) with FDR estimation. Some versions of this 
approach may lead to overly optimistic performance esti- 
mates. Decoy database-independent maximum likelihood 
approach may be an alternative to fit the parameters. 
Finally, the parameter optimization step dramatically 
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increases the run time of the algorithm (up to 2000 times), 
which compromises the overall speed of the method [71]. 
Other probabilistic approaches 

Yang et al. recently investigated protein inference from 
an information retrieval (IR) point of view [68]. This 
work is interesting because it leverages methods in the 
IR field to the protein identification problem in proteo- 
mics. The authors found that the Prob-OR score, which 
is similar to ProteinProphet without the two adjustment 
steps, is dramatically worse than Prob-AND score, 
which is related to the protein posterior probabilities 
computed by MSBayesPro if degenerate peptides were 
treated as unique to each parent protein. We emphasize 
that the IR method proposed by Yang et al. is inherently 
a ranking approach rather than an inference approach; 
hence, it does not directly address the shared peptide 
issue as do the other probabilistic approaches discussed 
above. 

Gerster et al. [69] recently reported a new probabilistic 
approach, Markovian Inference of Proteins and Gene 
Models (MIPGEM), that is similar to MSBayesPro and 
Fido. MIPGEM models peptide probabilities as random 
variables as in some previous approaches [66] and 
assumes conditional independence between peptide 
scores given their parent proteins (Markovian assump- 
tion). Similar to the Fido model, MIPGEM does not con- 
sider peptide detectability or unidentified peptides 
although the authors suggested that including detectabil- 
ity would be a future consideration. Table 2 provides a 
summary of the major probabilistic inference methods. 
Several other methods are reviewed in [35-37]. 

Discussion 

Our main goal in this review was to present the chal- 
lenges, intuition and proposed solutions to the protein 
inference problem. With increased throughput of proteo- 
mics experiments, the tools and approaches presented 
here will have increasingly more important applications 
to many problems in biology and biomedical sciences. 
These applications include inference and verification of 
gene models, identification of splice forms or post-trans- 
latioinally modified sites. Some of these problems can 
only be addressed using proteomics techniques and, as 
such, proteomics holds great promise in systems biology, 
biomarker discovery, diagnostics, prognostics and treat- 
ment monitoring. 

Undoubtedly, there is a need for more sophisticated 
methodology for protein inference, unbiased performance 
evaluation of these techniques, as well as stand-alone tools 
with graphical user interface that will facilitate transition 
from research environments to practice in biomedical 
sciences. We conclude this paper by discussing the current 
issues in evaluating protein inference algorithms and then 
speculating on the ideal protein inference approaches. 



Evaluation of protein identification methods 

Despite the development of computational protein iden- 
tification methods, objectively evaluating the perfor- 
mance of the methods remains a problem. Two strategies 
are currently available: the use of standard samples (mix- 
tures of known proteins) and the use of decoy protein 
sequences to estimate FDR at the protein level. Both 
approaches have limitations. 

To date, only a limited number of standard samples 
[78-80] containing 10-50 proteins have been used to 
facilitate evaluation of peptide/protein identification. The 
advantage of using standard samples is that the truth is 
known; thus, the accuracy measures, e.g. precision and 
recall, of protein identification can be directly computed. 
However, standard samples are frequently plagued by 
contaminant proteins and the boundary between true 
and false protein identification is blurred. Another limita- 
tion of standard samples is their small number of pro- 
teins, which leads to difficulties in assessing statistical 
significance in method comparisons. 

The second approach estimates protein-level false dis- 
covery rates with the help of decoy databases. Although 
the approach has been used in several studies [51,52], 
two serious problems of the approach are generally 
ignored. We suggest that using decoy databases for eva- 
luation of protein identification algorithms should be 
approached with these limitations in mind. First, unlike 
the decoy (e.g. reversed, randomized) database approach 
for peptides, the decoy database for proteins does not 
produce the correct estimation of the number of incor- 
rect protein identifications when the correct proteins 
comprise a significant portion of the database. In an 
extreme scenario, when all proteins in the database are 
present in the sample, all the identified proteins from the 
forward database are correct despite many peptides being 
in-correct identifications. On the other hand, all identi- 
fied proteins from a decoy database are incorrect. Thus, 
using a decoy directly will produce a non-zero FDR, 
while FDR = 0 is the correct answer. 

This problem can be addressed by correcting for the 
bias due to the number of true proteins in the forward 
database. Let the number of identified forward and 
decoy proteins he rip and and the total number of 
forward and decoy proteins in the databases be Np and 
Np,, respectively. Let the protein level FDR in forward 
database be FDRp and the rate of incorrect protein iden- 
tifications from the forward and decoy database be 

FDRp . Tip 
" Nf - {I - FDRp) ' tip' 

and 

riD 
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Table 2 A comparison between different probabilistic protein inference algorithms. 



Methods 


Protein Prophet 


iVIDDayeSriO 


rlOO 


IVIIrVJCIVI 


Underlying 

graph 

structure 


Bipartite graph with identified 
peptides and matching 
proteins^ 


Bayesian network with all peptides 
from proteins with at least one 
identified peptide 


Bayesian network with 
identified peptides and 
matching proteins 


k-partite graph with identified 
peptides, matching proteins and 
(optionally) matching gene models^ 


Inference 
algorithm 


EM (Expectation 
Maximization) like 


1) Exacts- 

2) Memorizing-Gibbs sampling 


1) Exact^ ; 

2) Pruning 
approximation 


1) Exacts- 

2) Direct sampling 


Input 


Probabilities for peptides with 
user-defined cutoff for p 
(often p > 0.05 is used) 


Likelihood ratios for peptides with 
p > 0.05 and peptide 
detectabilities 


Likelihood ratios for 

peptides 

with p > 0.05 


Probabilities for peptides with user- 
defined cutoff for p (often p > 0.05 is 
used; 0.9 for best performance) 


Output 


1) Protein probabilities; 

2) Protein group probabilities; 

3) NSP adjusted peptide 
probabilities 


1) MAP solution, protein 
abundances and probabilities; 

2) Protein group probabilities; 

3) Posterior peptide probabilities 


1) Protein probabilities; 

2) Protein group 
probabilities 


1) Protein probabilities; 

2) Gene model probabilities 


Protein prior 
estimation 


No protein priors 


Direct frequency estimation based 
on protein posterior probabilities 
in one run of MSBayesPro 


Grid search optimizing 
cross- 
validation performance 
through multi-runs of 

Pi/H/^ \A/i"t"K r\\TTarar'^'t 
nUO WILlI UIMclcIlL 

priors 


Grid search optimizing model 
likelihood through multi-runs of the 
MIPGEM with different priors 


Peptide 

probability 

adjustment 

h\/ 

oy 


NSP from a parent protein 


Protein quantity adjusted peptide 
detectability 


Two detectability-like 
parameters a, p 


Treating peptide identifications as 
random variables 


Protein 
grouping 


Yes 


No (indistinguishable proteins are 
resolved) 


Yes 


No (indistinguishable proteins are not 
resolved) 


Peptide 
charge 


Considered 


Ignored 


Considered 


Considered 
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1. For ProteinProphet, the underlying bipartite graph does not correspond to a Bayesian Network although it guides the EM-like algorithm through inference. 

2. MIPGEM uses a rule-based protein removal scheme to simplify the network structure; 

3. Exact computation is used only for small connected components; 

4. Gene centric proteomics was proposed in [77], and implemented earlier in a deterministic way in [67]. 



respectively. An assumption regarding a decoy database 
is that the rates of the false protein identifications are 
identical; hence, Yf = Yd, By solving this equation we find 

FDRp = -. 

tip ' (Nd - riD) 

Note that there is a correction factor 
(Nf — nF)(ND — Hd) in this equation compared to the 
FDR formula used for peptides. Also, when 
Np = np, FDRp = 0 as expected. A related correction is 
implemented in the MAYU approach [50] developed for 
FDR estimation from large proteomics data sets, i.e. the 
case when rip/Np ^ 0. Further corrections may be 
needed if the average lengths of the identified vs. non- 
identified proteins are different. 

We would like to point out that, for probabilistic protein 
inference algorithms, theoretical protein FDR values can 



be computed based on the protein posterior probabilities. 
However, such theoretical FDR values are only accurate 
when the reported protein posterior probabilities are accu- 
rate. Hence, they need to be evaluated themselves, e.g. 
against the target/decoy-based empirical FDRs. 

The second and more serious issue for applying the 
decoy approach is related to the existence of protein 
families. In fact, to our knowledge, no solution has yet 
been proposed. Simply speaking, a randomized database 
cannot serve as a good decoy for evaluating methods on 
data sets that contain many degenerate peptide identifi- 
cations. The reason is that such peptides are typically 
shared among forward proteins, which could be similar 
to each other due to biological/annotation reasons, but 
not with decoy proteins. As a result, a randomized pro- 
tein database cannot provide indications whether the 
identifications made among homologous proteins are 
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correct or not. For this reason, a randomized decoy 
database is expected to underestimate FDRs for eukar- 
yotic samples, which have large number of shared pep- 
tides (Figure 1). The problem might be addressed using 
well-constructed non-random sequence database or 
using a closely related proteome database as decoy. 
Evaluating protein inference algorithms using such non- 
random decoys, however, remains a research problem. 

We emphasize that both standard mixtures and the tar- 
get/decoy approach for complex samples have their pros 
and cons in evaluating protein inference algorithms, and 
they are not mutually exclusive approaches. In fact, stan- 
dard mixtures can be used to validate the target/decoy 
approach for protein FDR estimation. It is generally a 
good idea to use both strategies for a more complete and 
objective evaluation. 

A need for guidelines for comparisons between metliods 

Due to the complexity of protein inference, fair evaluation 
of the proposed methods has been challenging. This is due 
to two major aspects. First, reliable and objective valida- 
tion of the protein identification results is itself a challen- 
ging problem, as the FDR estimation is still unreliable. In 
addition, it is not even obvious how to compare models 
whose outputs are considerably different, e.g. those that 
provide protein groups and those that resolve ties between 
all proteins. Second, due to the lack of agreed upon guide- 
lines, avoidable unfair comparisons are sometimes seen in 
the literature [69]. In other works, different peptide identi- 
fication algorithms or scoring schemes are sometimes 
used as inputs to different protein inference methods, 
making the protein inference comparisons uninterpretable. 

In order to address this situation, we tentatively pro- 
pose the following principles for comparisons of protein 
inference algorithms. First, whenever possible, the same 
or equivalent peptide identification scores as input to dif- 
ferent programs should be used. Second, effort should be 
made to provide inputs most appropriate to each algo- 
rithm considered. For example, algorithms that take all 
peptide identifications should be provided all scores, 
while programs that take only confident identifications 
should be provided such a subset. Third, at least one 
standard protein mixture data set should be used and all 
known proteins (whether they belong to "indistinguish- 
able" protein groups or not) in such data sets should be 
included in the evaluation of the protein inference meth- 
ods. This will allow the evaluation of protein inference 
algorithms on proteins identified without any unique 
peptides. Finally, and in an ideal scenario, large data sets 
from complex samples of unknown proteins should also 
be used to compare different programs; however, we cau- 
tion that the current decoy database strategy may not 
provide reliable FDR estimates at the protein level (eva- 
luation for protein data sets with significant fraction of 
degenerate peptides is a particular problem). 



The ultimate protein inference approach 

Despite the amount of published work, the protein 
inference problem is far from solved. We believe two 
aspects are crucial to the future approaches. First, the 
model should be probabilistic and with degenerate pep- 
tides treated in principled ways. Second, unidentified 
peptides should be exploited with peptide detectability 
incorporated into the model, perhaps adjusted to allow 
modeling peptide competition at the elution stage in a 
given sample. Despite the current limitations of peptide 
detectability predictions, especially for non-tryptic and 
modified peptides, it is believed that including detect- 
ability [24,35,69,71] or peptide-specific information for 
peptide probability adjustment [21] would improve the 
current methods for protein inference. 

Furthermore, we believe that better estimation of pep- 
tide/protein quantity might also help protein inference 
by, for example, improving the quantity adjustment of 
peptide detectability [60,61], and provide additional 
input information for protein inference. As mentioned 
in the Introduction, protein inference can be viewed as 
a special case of protein label-free quantification. In fact, 
an ideal inference algorithm should automatically be a 
quantification algorithm, and vice versa. We believe 
much better performance can be achieved by combining 
the protein inference and quantification tasks into one 
statistical framework. 

Algorithmic development is equally important for rig- 
orous and yet practical probabilistic inference. Serang 
et al. [76] proposed an approximate solution by setting 
low peptide probabilities to zero and then applying the 
graph pruning procedure. In this way the complexity of 
the problem can be controlled at arbitrarily low levels 
with the price of potentially high error (i.e. the com- 
puted probability may greatly deviate from the exact 
values). The Gibbs sampling approach implemented in 
MSBayesPro can achieve arbitrarily high accuracy in 
probability estimation; however, the time required for 
the inference can be prohibitively long. A fast algorithm 
with controllable error bound is desirable. Applying 
well-established exact or approximate graph inference 
algorithms, e.g. the junction tree algorithm [76], is an 
important direction for further investigation. 

Additional material 



Additional file 1: Peptide detectability. 
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